This report explores a dataset which contains 4,898 white wines with 11
variables on quantifying the chemical properties of each wine. At least 3 wine
experts rated the quality of each wine, providing a rating betweern 0 (very
bad) and 10 (very excellent).
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
We dropped the column X which appears to be the row number. quality is an
ordered categorical variable with score between 0 and 10. It’s interesting
to see that our wine experts don’t rate the wines as extreme as of score 0
(very bad), 1, 2, or 10 (very excellent). The actual range is from 3 to 9 with
median at 6. The rest of variables are continuous variables which makes sense
since they represents the amount of the corresponding substance in the wine,
based on physicochemical tests.
From this histogram of quality counts, we can see it’s a normal distribution
with mean (solid line) and median (dashed line) with almost the same value.
Most of the white wine has a quality of 6, and second place is 5.
##
## FALSE TRUE
## 3838 1060
##
## midiocre premium
## 3838 1060
I remember my wine teacher always talks about Pareto principle (80/20 rule)
in the wine industry (Yes, I had a wine teacher). Wine of quality 7, 8, 9
makes up 27.6% of the total number of white wines rated. Therefore, we will
consider quality of 7, 8, 9 as premium.
##
## FALSE TRUE
## 4613 285
All wines contains sulphur dioxide in various forms, collectively known as
sulphites. Even in completely unsulphured wine it is present at concentration
of up to 10 mg/L. Commercially-made wines contain from ten to twenty times
that amount. (Source:
morethanorganic)
Reasons why SO2 is not desirable in wine:
According to EU law, the maximum permitted level of SO2 in white/rose wine
is 210 mg/l. As you can see in the first histogram, there are 285 wines
exceeded this limit. And we can observe that all three of them have a
right-skewed distribution. This might be due to the restriction of the
sulphate and most of the vineyards would obey the rules and avoid exceeding
the limit.
## [1] 3.188267
## [1] 3.18
In this set of histograms, we explore the acidity in wines. We have the first
three variables which are the amount of corresponding acid found in the wines.
The fourth variable pH indicates the acidity level where 7 is neutral and
smaller the value is, more acidic the liquid is. We observe a right skewed
distribution of the first three and a normal distribution of PH with median
and mean at 3.18 (acidic). It makes sense to have the PH histogram not right
skewed as the above 3 ones since the outliers in the acidity histogram would
have a lower PH value (tail on the left of PH histogram).
Some people believe that sweeter a wine is, the more alcohol it should
contain. We cannot tell this just by looking at the histogram here yet. We
will more into it in the bivariate plot section. Here we can see both residual
sugar and salt have very right skewed distribution. And the amount of salt is
really tiny for all white wines with maximum of 0.346 g/L. Histogram of
alcohol is a bit right skewed with peaks at around 9 - 9.5 %, it also is quite
uniform distributed other than the peak points. Most wines have alcohol level
of 8.5 - 12 %.
We take residual sugar only and add the feature of type to observe the
distributions for wines of both types. Log transformation is used here to
make the right skewed distribution look more standard.
We can see the shape are overall similar. With most wines have between 1 to 20
g/dm3 of residual sugar. Mediocre wine has a distinguishable bimodal normal
distribution with peaks at 1.7 and 8. Distribution of premium wines has three
peaks at 1.7, 5 and 13.5, but differences between local minimas and maximias
are not as extreme as those of mediocre wines.
The white wine quality dataset consists of 4898 observations and 12 variables.
Each observation is a white variant of the Portuguese “Vinho Verde” wine.
Among the 12 variables, there are 11 input variables (numeric) which represent
the amount of corresponding substance existing in the wine based on
physicochemical tests. The output variable quality is based on sensory data
(median of at least 3 evaluations made by wine experts), and it is an ordered
categorical data with range between 0 (very bad) and 10 (very excellent).
The main feature of interest is quality. I am curious in knowing how does
the amount of other factors affect the rating from the wine experts.
By reading the description of the dataset here,
I think volatile acidity, citric acid, free sulfur dioxide,
total sulfur dioxide, density may support my investigation. Because they
seems to affect the smell, taste and color. density may contribute to the
effect of “wine curtains” which is also a essential part of wine tasting.
Yes, I created a new categorical variable type to indicate whether a wine
is premium or mediocre where premium wines are the ones rated above 7
quality and mediocre the rest.
I dropped the column X which appears to be the row number. I also changed
quality to a ordered categorical variable with score between 0 and 10.
Many variables have a right skewed distribution with outliers on the far end
of the tail. However, quality is quite normal distribution with no extreme
value like 0, 1, 2, or 10. I haven’t removed the “outliers” from the dataset
because at this point I am not sure if their extremeness contribute to the
feature of interest.
This scatterplot matrix with all variables gives broad overview of what
variables might be interesting.
This correlation plot shows more clearly about the correlation between
variables.
An interesting observation is that the outliers they happen more at the
middle range qualities (5, 6, 7) than the extreme values. Very small amounts
of outliers can be observed for 9-quality or 3-quality wines.
If you look at the boxplot at quality 9 for each factor, notice that the “box”
is generally smaller than other qualities (especially density,
sulfur.dioxide). This suggests that there is a specific set of
charateristics in order to be rated as an “very excellent” quality Portuguese
“Vinho Verde” white wine. At this point, I’m impressed by the wine experts who
rated these wines. Just by blind tasting, they can detect the excellent wine
with the exact right amount of each substances.
I really like this boxplot of alcohol. To reach a quality of 9, alcohol
level has to be precisely around 12.5%. However, for other wines at same
quality, the alcohol level can have 6% in difference. Overall, there is a
trend starting from level 5 up of more alcohol, better the wine.
Looking at this scatterplot between residual.sugar and density, we can
spot a positive correlation between the two variables. An outlier can also be
spotted, it seems to be a good data with extreme value since it still respects
the density, residual.sugar correlation, but it just has an extreme high
residual.sugar level. It must be a really sweet wine. We will eliminate this
outlier from out spot and suset our data with residual.sugar less than 30
g/dm3.
Something interesting is happening here. Overall, the linear smooth line fits
well on the scatterplot. However, lower the residual.sugar level, wider the
range of density at the same residual.sugar level. My guess is when
residual.sugar level is low, density can be correlated with a third
variables.
From the set of boxplots, we can observe that alcohol seems to be
appreciated. With higher alcohol level, the median rating of quality is
generally higher.
pH, fixed.acid and citric acid shows slight positive correlation as well.
On the other side, sulfur.oxide, sugar, and density are not appreciated,
negatively correlated to quality.
We can observe that there is strong (0.853) correlation between density and
residual.sugar which is what I suspected before.
It’s only nutural to see that free.sulfur.dioxide and total.sulfur dioxide
has a correlation of 0.61.
Also as suspected before, sugar and density has a strong correlation
of 0.839. All other factors somewhat contribute to density a bit as we can
see the correlation ranges from 0.15 to 0.839 for density with other factors
except for the factor volatile acidity (corr: 0.0271).
Surprisingly, alcohol and residual.sugar have a negative correlation
of -0.427. alcohol and density also have a strong negative correlation of
-0.711, which makes sense since density and residual.sugar are highly
positively correlated.
From the boxplots on the quality column, we suspect that alcohol,
total.sulfur.dioxide, and density have some effects on the ratings of
wine quality by the wine experts.
The strongest relationship I found is between residual.sugar and density.
They have a correlation of 0.853. density and alcohol also has a strong
negative correlation of -0.78.
## [1] 9 440
Following previous analysis between residual.sugar and density,
total.sulfur.dioxide is added as a feature here. I first cut the variable
into 4 buckets (0, 100], (100, 150], (150, 210], (210, 440]. My guess
earlier was correct. At the same residual.suagr level, wines with lower
total.sulfur.dioxide level are less dense.
To make this set of plots, outliers (residual.sugar > 30) are removed from
the dataset.
We can see that the strong correlation between density and sugar doesn’t
change at no matter what quality.
Observe the second plot, we can see that at same level of sugar, premium wines
are less dense than midiocre wines. Mediocre wine also have a bigger range
of residual.sugar level (the outliers we didn’t show are also mediocre wines).
total.sulfur.dioxide and density are not as correlated sugar with
density but we can observe the same trend that the line of fit for premium
is lower than midiocre.
sugar and density seems to strengthen each other in terms of looking at
quality. At the same sugar level, premium wines tend to have less
density than mediocre wines. Extremely high sugar level has lower chance
of being rated as excellent wines.
Wines at quality levele 5, 6, 7 always have extreme level in features like
residual.sugar and sulfur.dioxide. This is surprising as they are not
rated as bad wines (level 2,3) but OK wines.
This is a histogram of quality counts of the wines. The dashed lines is the
median and solid line is mean. We can see that it’s a normal distribution with
mean and median at 6.
This plot has 3 dimensions, residual.sugar, total.sulfur.dioxide.bucket
cut from total.sulfur.dioxide and density. We can observe the positive
correlation between residual.suagr and density. At the same
residual.suagr level, wines with lower total.sulfur.dioxide level are less
dense.
From this plot, we can see that at same level of sugar, premium wines are less
dense than midiocre wines. Mediocre wine also have a bigger range of
residual.sugar level (the outliers we didn’t show here are also mediocre
wines).
At the beginning it was hard to understand what does each numeric variables
mean and how could they affect the quality of wine. After doing some research
and read more carefully on the documentation of the dataset, it became more
clear how I could explore this dataset. Another struggle is that there is
really subtle differences in the amount of variables, you can see from the
scatterplots that all the points are kind of all cluster together, it’s hard
to visualize when you just put quality as color in the same scatterplot. Maybe
some tranformation of data could be used in the future, to make it possible to
visually separate the clusters.